Why we are doing this study?

This is a study case used as a capstone for the Google Data Analytics certificate

Problem: My favorite activity in the world to do with my cat is sleeping. Could I optimize it by controlling the different variables that might affect it?

Solution: Let’s dive into a sleeping study to discover which variables will improve or ruin a good night sleep!

Statistical Information

library(tidyverse)
library(dplyr)
library(ggplot2)
library(patchwork)
library(ggpubr)
library(GGally)
library(caret)
data <- read.csv("sleepdata.csv")
head(data)
##   Person.ID Gender Age           Occupation Sleep.Duration Quality.of.Sleep
## 1         1   Male  27    Software Engineer            6.1                6
## 2         2   Male  28               Doctor            6.2                6
## 3         3   Male  28               Doctor            6.2                6
## 4         4   Male  28 Sales Representative            5.9                4
## 5         5   Male  28 Sales Representative            5.9                4
## 6         6   Male  28    Software Engineer            5.9                4
##   Physical.Activity.Level Stress.Level BMI.Category Blood.Pressure Heart.Rate
## 1                      42            6   Overweight         126/83         77
## 2                      60            8       Normal         125/80         75
## 3                      60            8       Normal         125/80         75
## 4                      30            8        Obese         140/90         85
## 5                      30            8        Obese         140/90         85
## 6                      30            8        Obese         140/90         85
##   Daily.Steps Sleep.Disorder
## 1        4200           None
## 2       10000           None
## 3       10000           None
## 4        3000    Sleep Apnea
## 5        3000    Sleep Apnea
## 6        3000       Insomnia
summary(data)
##    Person.ID         Gender               Age         Occupation       
##  Min.   :  1.00   Length:374         Min.   :27.00   Length:374        
##  1st Qu.: 94.25   Class :character   1st Qu.:35.25   Class :character  
##  Median :187.50   Mode  :character   Median :43.00   Mode  :character  
##  Mean   :187.50                      Mean   :42.18                     
##  3rd Qu.:280.75                      3rd Qu.:50.00                     
##  Max.   :374.00                      Max.   :59.00                     
##  Sleep.Duration  Quality.of.Sleep Physical.Activity.Level  Stress.Level  
##  Min.   :5.800   Min.   :4.000    Min.   :30.00           Min.   :3.000  
##  1st Qu.:6.400   1st Qu.:6.000    1st Qu.:45.00           1st Qu.:4.000  
##  Median :7.200   Median :7.000    Median :60.00           Median :5.000  
##  Mean   :7.132   Mean   :7.313    Mean   :59.17           Mean   :5.385  
##  3rd Qu.:7.800   3rd Qu.:8.000    3rd Qu.:75.00           3rd Qu.:7.000  
##  Max.   :8.500   Max.   :9.000    Max.   :90.00           Max.   :8.000  
##  BMI.Category       Blood.Pressure       Heart.Rate     Daily.Steps   
##  Length:374         Length:374         Min.   :65.00   Min.   : 3000  
##  Class :character   Class :character   1st Qu.:68.00   1st Qu.: 5600  
##  Mode  :character   Mode  :character   Median :70.00   Median : 7000  
##                                        Mean   :70.17   Mean   : 6817  
##                                        3rd Qu.:72.00   3rd Qu.: 8000  
##                                        Max.   :86.00   Max.   :10000  
##  Sleep.Disorder    
##  Length:374        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
#Number of null values per column
colSums(is.na(data))
##               Person.ID                  Gender                     Age 
##                       0                       0                       0 
##              Occupation          Sleep.Duration        Quality.of.Sleep 
##                       0                       0                       0 
## Physical.Activity.Level            Stress.Level            BMI.Category 
##                       0                       0                       0 
##          Blood.Pressure              Heart.Rate             Daily.Steps 
##                       0                       0                       0 
##          Sleep.Disorder 
##                       0
length(unique(data$Person.ID)) == nrow(data)
## [1] TRUE
gen_avg <- data %>% 
  group_by(Gender) %>%
  summarise(mean_age = mean(Age), n = n())

gen_avg
## # A tibble: 2 × 3
##   Gender mean_age     n
##   <chr>     <dbl> <int>
## 1 Female     47.4   185
## 2 Male       37.1   189

The average age of Women is higher. This could skew the data as age could be an important factor in sleep quality. But we do have a 50/50 gender representation.

Exploratory Data Analysis (EDA)

data$Sleep.Disorder = factor(data$Sleep.Disorder, levels = c('None','Insomnia','Sleep Apnea'))

df <- data %>% 
  group_by(Sleep.Disorder) %>% # Variable to be transformed
  count() %>% 
  ungroup() %>% 
  mutate(perc = `n` / sum(`n`)) %>% 
  arrange(perc) %>%
  mutate(labels = scales::percent(perc))

ggplot(df, aes(x = "", y = perc, fill = Sleep.Disorder)) +
  geom_col() +
  geom_label(aes(label = labels), color = "black",
            position = position_stack(vjust = 0.5),
            show.legend = FALSE) +
  scale_fill_brewer(palette = "Reds") +  
  coord_polar("y", start = 0) +  
  theme_void() +
  ggtitle("Percentage of Sleep Disorders") +
  theme(plot.title = element_text(hjust=0.5))

We do not have an equal representation of every sleep class, but we almost have a 60/40 representation of healthy and disease respectively.

my_comparisons <- list( c("None", "Insomnia"), 
                        c("Insomnia", "Sleep Apnea"), 
                        c("None", "Sleep Apnea") )

ggplot(data, aes(x=Sleep.Disorder, y=Quality.of.Sleep, fill=Sleep.Disorder)) +
  geom_boxplot()+
  geom_jitter(shape=16, position=position_jitter(0.2)) +
  scale_fill_brewer(palette="Reds") +
  stat_summary(fun = "mean", geom = "point", shape = 8,
               size = 2, color = "white")+ 
  stat_compare_means(comparisons = my_comparisons)+ # Add pairwise comparisons p-value
  stat_compare_means(label.y = 12) + # Add global p-value
  ggtitle("Quality fo Sleep score distribution \n per Sleep Disorder") +
  theme(plot.title = element_text(hjust=0.5))     

The average quality of aleep score for a person without a disorder is ~7.5. Sleep apnea does not seem to be correlated with a lower quality of sleep, we can see no significant difference between the Sleep Apnea and No Disorder classes. Insomnia is the only class where we see a significant loss in quality of sleep. The quality of sleep metric seems to be very subjective as it does not definitely indicate a sleep disorder or not.

gend_group <- data %>%
  group_by(Gender,Sleep.Disorder,Age)

gender_bar <- ggplot(data, aes(x=Gender, y=Age, fill=Sleep.Disorder)) +
  scale_fill_brewer(palette="Reds") +
  geom_bar(stat="summary", fun.y = "mean", position="dodge") +
  ggtitle("Sleep Disorders per Gender")+
  theme(plot.title = element_text(hjust=0.5))     

gender_bar 

As stated before, the female group has a higher average age and seems more affected by sleep apnea that the male group, while the male group seems the most affected by insomnia.

data$BMI.Category <- gsub("Normal Weight", "Normal", data$BMI.Category)
data$BMI.Category <- factor(data$BMI.Category, labels=c('Normal','Overweight','Obese'))

bmi_bar <- ggplot(data, aes(x=BMI.Category, fill=Sleep.Disorder)) +
  scale_fill_brewer(palette="Reds") +
  geom_bar() +
  ggtitle("BMI effect on Sleep") +
  theme(plot.title = element_text(hjust=0.5))
bmi_bar

We cleaned up this column by merging “Normal” and “Normal Weight” samples. The overweight category seems severely under represented.

A conclusion we can pull from this figure is that BMI is correlated with sleep quality. A higher BMI tends to indicate insomnia and/or sleep apnea, while in the lowest BMI category has a majority of normal sleep. Obese class has almost 50/50 split on both sleep disorders. We cannot make an conclusions on the Overweight category due to the small sample size but it does seem to indicate a correlation with sleep disorders.

data_stress <- data %>%
  group_by(Occupation) %>%
  summarize(avg_stress = mean(Stress.Level),
            avg_age = mean(Age),
            count = n())
  
occup_point <- ggplot(data, aes(x=Occupation, y=Stress.Level)) +
  geom_point(aes(fill=Sleep.Disorder,size=Age), color='black', shape=21, stroke=0.4) +
  scale_fill_brewer(palette="Reds") +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=0.5),
        plot.margin = margin(t = 5, r = 10, b = 5, l = 10),
        axis.title.x = element_text(margin = margin(t = 20)),
        plot.title = element_text(hjust=0.5)) +
  ggtitle("Occupation, Age and Stress \n effect on Sleep") 

occup_bar <- ggplot(data, aes(x=Occupation, fill=Sleep.Disorder)) +
  geom_bar() +
  scale_fill_brewer(palette="Reds")+
  ggtitle("Occupation \n effect on Sleep") +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=0.7),
        plot.margin = margin(t = 5, r = 10, b = 5, l = 10),
        axis.title.x = element_text(margin = margin(t = 20)),
        plot.title = element_text(hjust=0.5)) 

print(data_stress)
## # A tibble: 11 × 4
##    Occupation           avg_stress avg_age count
##    <chr>                     <dbl>   <dbl> <int>
##  1 Accountant                 4.59    39.6    37
##  2 Doctor                     6.73    32.7    71
##  3 Engineer                   3.89    46.6    63
##  4 Lawyer                     5.06    39.4    47
##  5 Manager                    5       45       1
##  6 Nurse                      5.55    51.8    73
##  7 Sales Representative       8       28       2
##  8 Salesperson                7       43.5    32
##  9 Scientist                  7       33.5     4
## 10 Software Engineer          6       31.2     4
## 11 Teacher                    4.53    41.7    40
occup_bar + occup_point + plot_layout(widths = c(1, 1))

The occupations with the highest stress scores seem to be the most correlated with sleep disorders, as illustrated by Sales representative, Salespersons and Scientists, especially causing sleep apnea. The occupations with a lower average stress score but a higher age group average, like nurses and teachers, also show an increase in the number of sleep disorders, even when the stress score is on the low end (see Nurses with score of 3).

BP_ranges = c('Normal', 'Elevated', 'Hypertension_1', 'Hypertension_2', 'Hypertensive_Crisis')
BP_systolic_limits = list(c(0,120),c(120,130),c(130,140),c(140,180),c(180,200))
BP_diastolic_limits = list(c(0,80),c(0,80),c(80,90),c(90,120),c(120,140))

# Categorize into BP ranges depending on the systolic and diastolic values 
data <- data %>%
  separate(Blood.Pressure, into=c('Systolic','Diastolic'), sep="/") %>%
  mutate(Systolic = as.numeric(Systolic),
         Diastolic = as.numeric(Diastolic)) %>%
  mutate(BP_range = case_when(
    between(Systolic, BP_systolic_limits[[1]][1], BP_systolic_limits[[1]][2]) & 
      between(Diastolic, BP_diastolic_limits[[1]][1], BP_diastolic_limits[[1]][2]) ~ BP_ranges[1],
    
    between(Systolic, BP_systolic_limits[[2]][1], BP_systolic_limits[[2]][2]) & 
      between(Diastolic, BP_diastolic_limits[[2]][1], BP_diastolic_limits[[2]][2]) ~ BP_ranges[2],
    
    between(Systolic, BP_systolic_limits[[3]][1], BP_systolic_limits[[3]][2]) & 
      between(Diastolic, BP_diastolic_limits[[3]][1], BP_diastolic_limits[[3]][2]) ~ BP_ranges[3],
    
    between(Systolic, BP_systolic_limits[[4]][1], BP_systolic_limits[[4]][2]) & 
      between(Diastolic, BP_diastolic_limits[[4]][1], BP_diastolic_limits[[4]][2]) ~ BP_ranges[4],
    
    between(Systolic, BP_systolic_limits[[5]][1], BP_systolic_limits[[5]][2]) & 
      between(Diastolic, BP_diastolic_limits[[5]][1], BP_diastolic_limits[[5]][2]) ~ BP_ranges[5],
    
    TRUE ~ "Unknown" # Default case
  ))


# New values to replace "Unknown"
replacement_values <- c(rep("Elevated", 12), rep("Hypertension_1", 2), "Elevated")

data <- data %>%
  mutate(BP_range = replace(BP_range, BP_range == "Unknown", replacement_values)) %>%
  mutate(Heart.Rate = as.factor(Heart.Rate))

data$Heart.Rate <- as.numeric(as.character(data$Heart.Rate))
# Group by BP_range and Heart.Rate, then count occurrences
data_grouped <- data %>%
  group_by(BP_range, Heart.Rate, Sleep.Disorder, Age) %>%
  summarise(Count = n(), .groups = "drop")  # Count occurrences of each Heart Rate

# Convert BP_range into a factor with the specified order
data_grouped$BP_range <- factor(data_grouped$BP_range, levels = BP_ranges)

blood_bar <- ggplot(data_grouped, aes(x = BP_range, y = Count, fill = Heart.Rate)) +
  geom_col() +  
  scale_fill_viridis_c(option = "magma", direction = -1) +  
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))+
  ggtitle("Heart rate of \n of blood pressure ranges") +
  theme(plot.title = element_text(hjust=0.5))

blood_bar

We separated the “Blood.Pressure” column into “Systolic” (#/) and “Diastolic” (/#) categories and then depending on their combination create a categorical blood pressure range column: “BP_range” from the American Heart Association . We manually clean up the “Unknown” ranges by making assumptions (putting more importance on the systolic number) and assigning them a category.

As expected, we can see an increase in heart rate with each blood pressure range.

data_grouped$Sleep.Disorder = factor(data_grouped$Sleep.Disorder, levels = 
                                       c('None','Insomnia','Sleep Apnea'))

blood_sleep <- ggplot(data_grouped, aes(x=BP_range, y=Heart.Rate)) + 
  geom_jitter(aes(fill=Sleep.Disorder,size=Age), color='black', 
              shape=21, stroke=0.4, width=0.3) + 
  scale_fill_brewer(palette="Reds") +
  ggtitle("Effect of BP range, Heart Rate \n and Age on Sleep") +
  theme(plot.title = element_text(hjust=0.5))

blood_sleep

We see an evident correlation between hypertension and sleep disorders. Regardless of age, a lower heart rate and a normal BP range indicates a lower propensity to sleep disorders, while the hypertension I and II have a higher proportion of insomnia and sleep apnea, with hypertension II being almost exclusively being sleep apnea. An elevated BP does not seem correlated with sleep disorders.

data <- data %>% 
  mutate(Stress.Level = as.factor(Stress.Level))

act_bar <- ggplot(data, aes(x=Physical.Activity.Level,y=Daily.Steps)) +
  geom_jitter(aes(size=Age, fill=Stress.Level), color='black', shape=21, stroke=0.4) +
  scale_fill_brewer(palette="Blues") +
  geom_smooth(method=lm,  linetype="dashed",
              color="darkblue") +
  ggtitle("Effect of Physical Activity, \n Daily Steps and Age on Stress") +
  theme(plot.title = element_text(hjust=0.5))

act_stress_bar <- ggplot(data, aes(x=Physical.Activity.Level,y=Daily.Steps)) +
  geom_jitter(aes(fill=Sleep.Disorder, size=Age), color='black', shape=21, stroke=0.4) +
  scale_fill_brewer(palette="Reds") +
  geom_smooth(method=lm,  linetype="dashed",
              color="darkred")+
  ggtitle("Effect of Physical Activity, \n Daily Steps and Age on Sleep") +
  theme(plot.title = element_text(hjust=0.5))

act_bar + act_stress_bar + plot_layout(widths = c(1, 1))

We see a really strong correlation between physical activity and daily steps with the stress level, the lower the physical activity and daily steps, the higher the stress level the person feels regardless of their age. We see a couple of outliers where the stress levels are high at the top right (highest number of steps with highest activity level). As seen by the following table:

data %>%
  group_by(Occupation, Stress.Level) %>%
  summarise(avg_steps = mean(Daily.Steps),
            avg_activity = mean(Physical.Activity.Level))
## # A tibble: 32 × 4
## # Groups:   Occupation [11]
##    Occupation Stress.Level avg_steps avg_activity
##    <chr>      <fct>            <dbl>        <dbl>
##  1 Accountant 3                7500          80  
##  2 Accountant 4                7000          60  
##  3 Accountant 6                7200          53.3
##  4 Accountant 7                6000          45  
##  5 Doctor     3                6850          87.5
##  6 Doctor     5                3500          65  
##  7 Doctor     6                8000          75  
##  8 Doctor     8                5848.         31.8
##  9 Engineer   3                5176.         30.9
## 10 Engineer   4                5644.         63.9
## # ℹ 22 more rows

The higher number of steps could be a result of the job and not a result of leisure sport, especially with nurses who have the highest number of steps over all the other occupations (8058 steps).

Insomnia seems to be correlated with a lower level of physical, while sleep apnea does not follow this pattern. The latter one seems to mainly affect older people regardless of the physical activity.

sleep_q <- ggplot(data, aes(x=Sleep.Duration,y=Quality.of.Sleep)) +
  geom_jitter(aes(fill=Sleep.Disorder), color='black', 
              shape=21, stroke=0.4, size=3, width=0.5) +  
  scale_fill_brewer(palette="Reds") +
  ggtitle("Effect of Sleep Duration and \n Quality of Sleep on Sleep Disorders") +
  theme(plot.title = element_text(hjust=0.5))

sleep_q

Quality of sleep seems to be strongly correlated with sleep duration. As expected insomnia is correlated with a lower sleep duration and quality of sleep. Sleep apnea does not seem to be correlated with neither sleep duration nor sleep quality, we find sleep anea in both high and low quality of sleep score ranges as well as short and long sleep duration. As quality of sleep is a subjective measure if the sleep apnea is mild it might not affect the person itself (maybe only those around them lol).

library(corrplot)

# Compute correlation matrix (only for numeric variables)
numeric_data <- data[sapply(data, is.numeric)]
numeric_data <- numeric_data[2:ncol(numeric_data)]
cor_matrix <- cor(numeric_data, use="complete.obs")

# Visualize using corrplot
corrplot(cor_matrix, method="color", type="upper", 
         col= colorRampPalette(c("blue", "white", "red"))(200),
         tl.col="black", tl.srt=45)  # Adjust label angle

Conclusions

We have a variety of variables affecting sleep disorders. Some of them are inevitable such as age or gender. Older folks and women seem to have a higher correlation with sleep apnea and insomnia. However some of the variables we have a say over, such as BMI, physical activity, stress levels and occupation (maybe this last one is not so easy a change). All of the aforementioned variables are also strongly correlated with each other. Higher physical activity also means lower stress and blood pressure ,in addition we could hypothesize a lower BMI. Improving all these variables would inversely correlate with sleep disorders.

The goal of this study case was to use data analytics to derive insights, however I want to also apply some machine learning to obtain an order of feature importance and to see if this data is enough to predict a sleep disorder and to push myself to perform some ML in R instead of my usual python.

Pre-processing data

df <- data %>%
  select(!c(Person.ID)) %>%
  mutate(
    Gender = if_else(data$Gender == "Female", 0, 1), #Female=0, Male=1,
    Occupation = as.integer(factor(data$Occupation)),
    BMI.Category = as.integer(factor(data$BMI.Category)),
    BP_range = as.integer(factor(BP_range)),
    Stress.Level = as.integer(Stress.Level)
  )

We now check the distributions of the numeric variables that will maybe need to be standardized: Sleep Duration, Quality of Sleep, Physical Activity,

age <- ggplot(data, aes(x=Age)) + geom_histogram() 
sleepdur <- ggplot(data, aes(x=Sleep.Duration)) + geom_histogram() 
sleepqual <- ggplot(data, aes(x=Quality.of.Sleep)) + geom_histogram() 

age+sleepdur+sleepqual+ plot_layout(widths = c(1, 1,1))

phy <- ggplot(data, aes(x=Physical.Activity.Level)) + geom_histogram() 
stress <- ggplot(data, aes(x=Stress.Level)) + geom_histogram(stat="count") 
sys <- ggplot(data, aes(x=Systolic)) + geom_histogram() 

phy+stress+sys+plot_layout(widths = c(1, 1,1))

dia <- ggplot(data, aes(x=Diastolic)) + geom_histogram() 
hr <- ggplot(data, aes(x=Heart.Rate)) + geom_histogram() 
ds <- ggplot(data, aes(x=Daily.Steps)) + geom_histogram() 

dia+hr+ds+ plot_layout(widths = c(1, 1,1))

We have really uneven distributions for the data, but due to the small number of samples I do not want to manipulate the data too much and skew the results. We will use a random forrest model that does not require standardisation of the input data.

set.seed(123)  # For reproducibility
split <- createDataPartition(df$Sleep.Disorder, p = 0.8, list = FALSE)

train_data <- df[split, ]
test_data  <- df[-split, ]
# Convert target variable to factor (for classification)
train_data$Sleep.Disorder <- as.factor(train_data$Sleep.Disorder)
test_data$Sleep.Disorder  <- as.factor(test_data$Sleep.Disorder)

# Set up training control with cross-validation
train_control <- trainControl(method = "cv", number = 5)  

# Train Random Forest model
model <- train(Sleep.Disorder ~ ., data = train_data,
               method = "rf",  
               trControl = train_control,
               tuneLength = 5)  

# View model details
print(model)
## Random Forest 
## 
## 301 samples
##  13 predictor
##   3 classes: 'None', 'Insomnia', 'Sleep Apnea' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 241, 240, 241, 242, 240 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9036445  0.8303087
##    4    0.9036445  0.8303087
##    7    0.9036445  0.8309179
##   10    0.9070344  0.8367769
##   13    0.8938085  0.8105778
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 10.
# Make predictions
predictions <- predict(model, newdata = test_data)

# Evaluate accuracy
conf_matrix <- confusionMatrix(predictions, test_data$Sleep.Disorder)
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    None Insomnia Sleep Apnea
##   None          41        2           0
##   Insomnia       2       12           1
##   Sleep Apnea    0        1          14
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9178          
##                  95% CI : (0.8296, 0.9692)
##     No Information Rate : 0.589           
##     P-Value [Acc > NIR] : 3.722e-10       
##                                           
##                   Kappa : 0.8554          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: None Class: Insomnia Class: Sleep Apnea
## Sensitivity               0.9535          0.8000             0.9333
## Specificity               0.9333          0.9483             0.9828
## Pos Pred Value            0.9535          0.8000             0.9333
## Neg Pred Value            0.9333          0.9483             0.9828
## Prevalence                0.5890          0.2055             0.2055
## Detection Rate            0.5616          0.1644             0.1918
## Detection Prevalence      0.5890          0.2055             0.2055
## Balanced Accuracy         0.9434          0.8741             0.9580

Conclusions

The model can predict a sleep disorder with a 93% accuracy. We do note an imbalance in sensitivity between the different groups but as seeing as there is a very imbalanced number of samples for each group this was to be expected.

Dr. Pomelo revising my script.